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Abstract 

We analyze a learning method that uses a margin k a la Gardner for sim- 
ple perceptron learning. This method corresponds to the perceptron learning 
when k = 0, and to the Hebbian learning when k — > oo. Nevertheless, we 
found that the generalization ability of the method was superior to that of 
the perceptron and the Hebbian methods at an early stage of learning. We an- 
alyzed the asymptotic property of the learning curve of this method through 
computer simulation and found that it was the same as for perceptron learn- 
ing. We also investigated an adaptive margin control method. 

Keyword 

On-line learning, Margin, Simple perceptron, Generalization ability, Perceptron 
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Symbols 



x : 


A input 


N : 


Dimension of a input 


B : 


Teacher's weight vector 


J : 


Student's weight vector 


v : 


Teacher's total input 


ul 


Student's total input 


u : 


Normalized student's total input 


sgn(-) 


Sign function 


©(') 


Threshold function 


/t : 


Margin 


i? : 


Overlap between teacher and student weight vectors 


I : 


Student's weight vector length 




Generalization error 


up : 


Argument of teacher and student weight vectors 


P(m,w): 


Distribution of v and u at limit of N — > oo 
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1 Introduction 

Applying a margin to the decision boundary improves the generalization ability of 
many algorithms, for example, the support vector machine (SVM). Generalization 
ability is defined as the classification ability for learning samples not previously 
learned. The SVM places the decision boundary where the margin is maximized, 
and the support vectors are the learning samples closest to the decision boundary In 
other words, the SVM improves the generalization ability by maximizing the margin 
as regards the learning samples already learned. Improving generalization ability is 
a key step towards solving the learning problem, and we believe that incorporating 
another form of learning - which we do by introducing the margin - can improve 
generalization ability. 

Statistical mechanics has been used to analyze the learning ability of feed-forward 
neural networks or the effectiveness of information processing such as image restora- 
tion P , and are often used to study the dynamics of learning or the ability of neural 
networks because they can depict the macroscopic dynamics of an object. The on- 
line learning of the simple perceptron, which consists of an input layer and an output 
unit, has been extensively studied using this approach In the on-line learning, 
the network parameters are modified when a learning sample is presented and this 
sample is not used for feature learning. 

Perceptron learning ilj ~ [3J is a learning method applied through the simple 
perceptron. Learning occurs when the sign of the student's output differs from that 
of the teacher's output for an input. 

Requiring that the absolute value of the total input be larger than some margin, 
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even if the sign of the student's output agrees with that of the teacher's output, 
would be beneficial because a learning sample of smaller absolute value of the total 
input than the margin is near the class boundary and can be easily moved to another 
classes by noise. Therefore, we propose a learning method in which a margin is 
applied a la Gardner jl] to a simple perceptron. Rosenblatt jS] used similar learning 
algorithms. 

Our algorithm is equivalent to perceptron learning when the margin is zero and 
equivalent to Hebbian learning when the margin is infinity. The method is thus 
intermediate between these two learning methods. We analyzed the dynamics of 
our algorithm through the statistical-mechanical method. The dynamics of this 
learning method seems to be intermediate between those of perceptron learning and 
those of Hebbian learning. Surprisingly, though, our learning algorithm is superior 
to the perceptron learning and the Hebbian learning in terms of the generalization 
error in the early stage of learning. 

In Section 2, we review the theory of on-line learning, explain the generalization 
error, and give the order parameters we used to depict the learning dynamics. In 
Section 3, we explain the formulation of our algorithm by showing the learning equa- 
tion we use and deriving coupled macroscopic differential equations. These coupled 
differential equations are solved in Section 4, and the dynamics of our method are 
obtained. The dependence of the generalization error on the margin in our algo- 
rithm is also discussed. In Sections 5 and 6, respectively, we discuss the asymptotic 
property of our algorithm and the adaptive margin control method. 
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2 THEORY OF ON-LINE LEARNING 

2.1 Simple perceptron 

In this paper, we evaluate the learning ability by using the teacher-student formula- 
tion. The teacher outputs the answer to the input. Learning ability is evaluated by 
how close the student's output is to the teacher's output. The teacher and the stu- 
dent are simple perceptrons and are formed with similar structures, as showed in Fig. 
[U We assume that the input x is randomly selected according to a probabilistic dis- 
tribution of P(x). Teacher outputs sgn(-u) correspond to the input x = {xi, . . . x^}. 



N 



sgn(v) = sgn l^T BiXi J = sgn (B ■ x) (1) 



j=i 



B = {B U ...B N } (2) 

Here, sgn(x) denotes the sign function that outputs 1 when x > and outputs — 1 
when x < 0. The student outputs sgn(wZ) for the input x in the same way as the 
teacher. 



sgn(wZ) = sgn Jix\ = sgn i 



[J ■ x) (3) 
J = {J h ...J N } (4) 

Here, I is a proportional multiplier. 

In the learning process, the student updates its weight vector according to the 
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equation 

J m+1 = J m + f{v,u,l)x. (5) 

Here, m is the learning iteration number. f(v,u,l) is the function related to the 
learning algorithm used in the learning process. 

2.2 Assumptions 

There are two types of learning procedures - off-line learning and on-line learning. 
In off-line learning, all the learning samples used in the learning are prepared be- 
forehand. The samples are fixed in each learning process, and the student iterates 
the updating using Eq. (J5|i. Here, the learning sample is the set of input x and its 
corresponding teacher's output. In on-line learning, the student updates the weight 
vector by using a single sample. The sample is not used again in the subsequent 
learning. In this case, the input x and the student weight vector J become sta- 
tistically independent when N is sufficiently large, and the analysis becomes easy. 
Therefore, in this paper, we discuss learning ability based on on-line learning. 

We consider the thermodynamic limit of iV — > oo in the following discussions. 
The teacher's weight vector B is generated from random numbers taken from a 
Gaussian distribution of mean zero and unit variance. When N is sufficiently large, 
the size of the weight vector becomes \B\ — y/~N. The student weight vector J is 
generated in the same way as the teacher's weight vector. When the student weight 
vector J is updated, the size of the student weight vector, denoted by | J\ = l\fN, 
is changed, then we use the proportional multiplier I and assume that the size of I 
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is finite. The elements of the input vector are also generated from random numbers 
taken from a Gaussian distribution of mean zero and variance 1/iV; that is \x\ = 1 
when N is sufficiently large. From the above formulation, self-averaging can be 
assumed. In the next paragraph, we briefly explain this self-averaging. 

First, we consider the size of J and x to be of the same order. When a input 
x is presented, whether the teacher and student outputs have the same sign is 
a statistical phenomenon. In other words, the student weight vector J will be 
updated to either J + x or J. Thus, there are 2 m possible states of J after the 
m-th learning iteration. However, 2 m statistical variables are too many to handle 
without difficulty. Hence, we assume that the absolute values of J and B are large 
compared with that of x. Many learning samples would be required to increase 
the student weight vector length from I to I + dl. To depict the trajectory of I, we 
need to consider only the statistical effects of the inputs. This treatment is called 
self- averaging in statistical mechanics. In this manner, we introduced the above 
formulation to make the problem easier to handle. 

There are two reasons for assuming the thermodynamic limit in the learning 
theory. The first is that the deterministic differential equations of the order pa- 
rameters I and R can be derived because the central limited theorem can be used 
at the thermodynamic limit. For example, the differential equation of I is derived 
from Eq. (|5]l; however, x and J are random variables, so the equation becomes 
the random recurrence formula. Random variables u = J ■ x and v = B ■ x follow 
the Gaussian distribution P(u, v) of zero mean and unit variation when the input 
x is independent of the weight vector J, and the central limited theorem can be 
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assumed. Second, the generalization error is calculated by averaging the error with 
respect to the input distribution P(x). 

t g = J dxP(x)(sgn(B ■ x) — sgn(J • a;)) 2 (6) 

In general, this calculation is difficult because it requires the iV-th multiple integral. 
However, random variables u and v follow Gaussian distribution P(u, v) because 
the central limited theorem can be used at the thermodynamic limit, so the general- 
ization error can be calculated by averaging the error according to two-dimensional 
Gaussian distribution P(u,v). 

e g = J dudvP(u, f)(sgn(t>) — sgn(wZ)) 2 (7) 

Moreover, the generalization error is calculated by using the direction cosine R of 
u and v as shown by Eq. in Sec. 2.4. As shown above, by assuming the 

thermodynamic limit, we can calculate the generalization error e g . 

2.3 Conventional on-line learning algorithms 

In Hebbian learning, the weight vector J is updated using the equation 

jm+l = J m + sgn ^) . x (g) 

where m is the iteration number. Using Eq. (jHJ), the weight vector J m is updated 
according to the teacher output sgn(t>). The student input potential u is not used 
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to update the student weight vector J. In perceptron learning, the updating rule is 

jm+l = J m + Q (_ ul; ) . ggn^) . x (9) 

In this equation, the function Q(x) returns 1 when x > 0, and returns when 
x < 0. The use of this function in perceptron learning means that the weight vector 
is updated when the sign differs between the student's and the teacher's output. 

2.4 Relationship between generalization error and direction 
cosine 

The generalization error e g is used as a criterion for the quality of the learning. In 
on-line learning, the generalization error is defined as the probability that a student 
who has learned m learning samples will answer with an output different from the 
teacher's when the (m + l)-th input is presented. The overlap R is one of the order 
parameters used to describe the dynamics of the generalization error. The overlap 
is the direction cosine of the weight vectors of the teacher and the student, which is 
defined as 

E> J 1 * 

fl= £l = _Lr 5 .r, (10) 

LB J Nl^ 1 K ' 

j=i 

In Fig. El the teacher weight vector B and the student weight vector J are 
depicted for an input dimension of N = 2. The angle between B and J is denoted 
by (p. The input x is normalized as \x\ = 1, then the inputs are distributed on the 
circumference of a circle with unit radius. The teacher output is sgn(t> ) = sgn(i?-£c), 
so the input space is separated into two regions by a line orthogonal to the teacher 
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weight vector B. This line forms the class boundary. In the same way, the student 
output is sgn(w) = sgn(J ■ x), so the input space is separated into two regions 
by a line orthogonal to the student weight vector J. This line forms the decision 
boundary. 

Since (p is the angle between the teacher weight vector B and the student weight 
vector J, the teacher and the student outputs differ in the areas defined by the 
thick arcs along the unit circle in Fig. |21 We assume that the input x is selected 
at random from the circumference of the unit circle, so the generalization error e g is 
given by the ratio of the circumference of the unit circle to the length of the thick 
arcs: 

* = £ <»> 

7T 

if can be calculated as 



Therefore, we can calculate the generalization error by using the overlap R instead 



Another aspect of the relationship between the generalization error e g and the 
overlap R is as follows. As explained, the generated input x is statistically inde- 
pendent of the teacher weight vector B. The student weight vector J and the new 




(12) 



From Eqs. (jll)) and (fT2"|) . the generalization error is defined as 




(13) 



ofEq. ©. 
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input x are also statistically independent. The distribution of the total input of 
the teacher v and the normalized total input of the student u therefore becomes 
a Gaussian distribution with mean zero, unit variance, and correlation of R. This 
distribution is denoted P(u, v) and is written as 



where R is the overlap defined by Eq. ()10p. 

Figure 0(a) depicts P(u,v) when the overlap R equals zero. The abscissa axis is 
the total input of the teacher and the ordinate axis is the total input of the student. 
Because R = 0, from Eq. (|14j). the distribution forms a circle as shown in Fig. 0(a)- 
Since the signs of v and u differ, an error occurs in the region marked by the oblique 
lines. On the other hand, as the learning progresses and overlap R approaches a 
value of one, the distribution P(u, v) asymptotically approaches the line of v — u = 0. 
This is depicted in Fig. 01 (b). Again, an error occurs in the area marked by the 
oblique lines, but this area is much smaller in Fig. 01 (b) than in Fig. 0(a). This 
confirms that as overlap R approaches one, the generalization error approaches zero. 




(14) 



3 FORMULATION 



3.1 Learning equations of proposed algorithm 



The learning equation discussed in this paper is 
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jm+l = J m + 

K ~ S J J X J I sgn ( ^2 B i x i I I s S n ( v ) x , (15) 



where m is the number of iterations, and k is the margin. In perceptron learning, 
if ^i x j) s g n (X]j BjXj) > 0, no learning occurs because the signs of the teacher 
and the student outputs are the same. However, it is a good idea to require that 
the size of (y\ JjXj)sgn(Y2j BjXj) be larger than the margin even if the signs of v 
and u are the same because a input of smaller total input than the margin will be 
near the class boundary and can be easily moved to another class by noise. To do 
this, we introduced the margin k into perceptron learning as shown in Eq. ()15j) . By 
rewriting Eq. (fTHjl . we derived Eqs. (fT7)|) and (fTTj) : 



jm+l = jm + f( v ^ u J} x ( 16 ) 

f{y,u,l) = e(-(lsgn(v)u - K,))sgn(v) (17) 

We will explain the effectiveness of the margin n by using Fig. 01 which shows 
the distribution of P(u, v) when the overlap R = 0. In this figure, the abscissa 
axis is the total input of the teacher and the ordinate axis is the total input of 
the student. In the region marked by the oblique lines, the signs of v and u differ. 
Learning occurs in this region when perceptron learning is used, but does not occur 
in the other regions. When Hebbian learning is used, all the regions are the object 
of learning. The dashed lines depict the margin k. Our algorithm enables learning 



On-Line Learning with a margin 



13 



when the absolute value of the total input of the student \ul\ is below the dashed 
line. 

As shown in Fig. |3J when k — 0, our algorithm enables learning within the 
regions marked by the oblique lines - the same learning region as for perceptron 
learning. This can be shown by applying k = to Eq. (|17j) : 

f(v,u,l) = 0(— lusgn(v))sgn(v ) 

= 0(— w)sgn(u) (18) 

On the other hand, as Fig. 0] shows, when k is infinity, our method enables 
learning in all the regions, as is the case with Hebbian learning. And when 0(oo) 
equals 1, Eq. (fTTj) can be rewritten as 

f(v,u,l) = sgn(u), (19) 

which shows that our algorithm is equivalent to Hebbian learning when k is infin- 
ity. Thus, our algorithm represents an intermediate form between perceptron and 
Hebbian learning. 

When < k < oo, the learning occurs as follows. From Eq. (|17|h our algorithm 
learns by using Hebbian learning when the absolute value of the total input \ul\ is 
below the margin k even if the signs of the teacher's and the student's output are the 
same. Thus, our algorithm enables learning in regions where perceptron learning 
is not possible. We therefore expect this method to be capable of generalization 
ability better than that of perceptron learning. 
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3.2 Differential equations of learning dynamics 

Next, we will derive and analyze the coupled differential equations of the overlap R 
and the length of student weight vector I. The overlap R is the direction cosine of 
the teacher weight vector B and the student weight vector J. 

As discussed in Sec. 2.2, we formulated that the size of the weight vector \J\ is 
0(\fN) and that the size of the input vector \x\ is 1. This means that we need N 
input vectors to have AJ changes. Consequently, we define the learning iteration 
m as m = Nt and use the continuous variable t to represent the learning process. 

By using this formulation, a time-dependent differential equation of the student 
weight vector length I can be derived. To obtain this differential, we square both 
sides of Eq. (jIBj) . By averaging the terms of Eq. (JT5j) by the distribution of P(u, v), 
we obtain the differential equation for I. A more explicit derivation is given in the 
Appendix. 

The differential equation of the direction cosine R is obtained by calculating the 
product of B and Eq. (|16|). The differential equation of R is then obtained through 
a calculation similar to that used for I. An explicit derivation is again given in the 
Appendix. The obtained coupled differential equations are 



The symbol (• • ■ ) means averaging over the distribution P(u,v). 

The main purpose of the on-line learning theory described in this paper is to 
calculate the generalization error e g . We can calculate e g by using the order param- 



dl 

dt 
dR 

~dt 




(fv) - (fu)R R 

I 2/ 2 



(f 2 )- 



(20) 



(21) 
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eters R and /. Thus, we should know the time dependence of the order parameters 
R and I to obtain that of the generalization error. The time dependence of the order 
parameters R and I is described by Eqs. (J2Uj) and (j5T|). To calculate Eqs. (J2U|) and 
(JUJ), we should know the statistical average of (fv), (fu) and (f 2 ) with respect to 
v = B-x and ul = J x. Then (fu), (fv), and (f 2 ) are calculated for our algorithm. 

(fu) is calculated by averaging the product of Eq. (fT7|) and the total input of 
the teacher v over P(u, v). (fv) is calculated by averaging the product of Eq. (JT7J) 
and the total input of the student over P{u,v). (f 2 ) is calculated by averaging the 
square of Eq. (fTTj) over P(u, v). The results are shown in Eqs. and (j^ljl . 




(22) 




(23) 




(24) 



where, 
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Dx 



V2^ 



dx 




) 



(25) 



H(u) = I 



OO 



Dx. 



(26) 



J u 



We substitute zero for k in Eqs. (j22J) ~ (|2^j) and are identical with (fu), (fv), 
and (f 2 ) of perceptron learning. Likewise, we substitute infinity for k in Eqs. (J22|) 
~ (J2"2|) and are identical with those of Hebbian learning. 

4 Results 

First, we will consider the results for k = 10. The generalization error was cal- 
culated analytically by applying the overlap R to Eq. (pH?|) . R was obtained by 
numerically solving Eqs. ()20|) and (}2"Tj) . The generalization error curve obtained 
through the analytical calculation is shown in Fig. Efa). In this figure, analytical 
results (solid line labeled by "ana" ) and a numerical simulation results (dashed line 
labeled by "num") are shown. The input dimension N used in the numerical sim- 
ulation was 1000. These are almost identical, and the numerical simulation results 
are distributed around the analytical results. The time step corresponded to the 
presentation of iV learning samples. The results for k = 0, which corresponds to 
perceptron learning, and for k — ► oo, which corresponds to Hebbian learning, are 
also shown in Fig. EJa) for comparison. Because our algorithm represents an inter- 
mediate form between perceptron and Hebbian learning, we expected the learning 
dynamics of Eq. (|17j) to be midway between those of perceptron learning and those 
of Hebbian learning. However, the generalization error of our algorithm was lower 



On-Line Learning with a margin 



17 



than that of either alternative learning method from t = 10 to 200. Therefore, at 
an early stage of learning, our algorithm seems to be superior to both perceptron 
and Hebbian learning in this respect. The generalization error for each time step 
was calculated using Eqs. (fTBj) and (fTTj) . 

Next, we analyzed how the margin k affected the generalization error. Figures 
Efb) and H^a) and (b) show the results for k of 1, 100, and 1000. These figures 
show results of the analytical calculation and the numerical simulation. As shown, 
the generalization error with our algorithm tended to be lower than that of both 
perceptron learning and Hebbian learning, particularly at the early stage of learning, 
for every margin. Therefore, we expect similar behavior with any margin k. 

Moreover, we analyzed for the margins k = 10~ 5 and k = 10 5 . As we explained 
in Sec. 3.1, when the margin is relatively small, our algorithm will performe as 
perceptron learning, and when the margin is relatively large, it will performe as 
Hebbian learning. In Fig. Ufa) and (b), the analytical results for k = 10~ 5 and 
k = 10 5 are shown. The results for perceptron learning and Hebbain learning are 
also shown for comparison. In Fig. |7fa), the results for k = 10 -5 (solid line labeled 
by "10"-[-5}(ana)") are plotted until t = 7500 to show how the results matched to 
the perceptron's (dashed line labeled by "Perceptron"). The results for k = 10 -5 
and perceptron are identical. In Fig. [Tfb) , the results for k = 10 5 was identical to 
the results for Hebbian learning. 



On-Line Learning with a margin 



18 



5 Asymptotic property 

The asymptotic property of the generalization error of perceptron learning is known 
to be t to the power of —1/3 and the asymptotic property of the generalization 
ability of Hebbian learning is t to the power of —1/2 pQ. Therefore, we investigated 
the asymptotic property of the generalization error of our method in the region of 
t > 10000. The results are shown in Tabled 

The asymptotic property with our algorithm was close to t to the power of 
—1/3. In the region of large t, the length of the student weight vector I increased 
monotonically. In this case, the margin k can be considered constant along time t, so 
at the limit of t — » oo, the actual margin k/1 will converge to zero. The asymptotic 
with our algorithm will then equal that with perceptron learning. Thus, we consider 
the asymptotic property with our algorithm to be t to the power of —1/3. 

6 Adaptive margin control 

The generalization ability of our method became superior to that of Hebbian and 
perceptron learning when we introduced the margin into perceptron learning. This 
occurred when t was close to the margin k, and our method converged toward the 
dynamics of perceptron learning. Thus, we naturally think that adjusting the margin 
with respect to the learning time might make our method superior to both Hebbian 
and perceptron learning. For instance, this could be done by setting the margin to 
some small value in the early stage of learning, and gradually enlarging it. We tried 
to find the optimum value of a for k — al to overcome the generalization error of 
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Hebbian learning. 

In Fig. |H(a), the margin was controlled so that k = I. The generalization ability 
of our method was improved and was superior to that of Hebbian learning when 
1 < t < 100. However, controlling the margin failed for t > 100. Controlling the 
margin fully succeeded, though, when we adjusted the margin k to 1.5/ (Fig. Efb)). 
In this case, however, the difference in the generalization errors of our method and 
Hebbian learning was small. We also investigated the case where k = 21 (Fig|H|). 
The generalization error of our method was smaller than that of Hebbian learning, 
but the difference between the two methods was smaller than that of FigJHJb) . 

7 CONCLUSIONS 

We have described a new learning method that uses the margin k a la Gardner 
for perceptron learning. This method can correspond to either Hebbian learning or 
perceptron learning depending on the size of k. Coupled differential equations of 
order parameters R and I, where R is the overlap of the teacher weight vector B and 
the student weight vector J, and I is the length of the student weight vector, have 
been derived for our algorithm. Our analytical results show that the generalization 
error with our algorithm tends to be lower than that of either Hebbian or perceptron 
learning at the early stage of learning over a wide range of k. Also, the asymptotic 
property of the generalization ability with our algorithm was equal to that of per- 
ceptron learning. Moreover, we investigated the effect of margin adaptation and 
found that the generalization error of our method was superior to that of Hebbian 
learning when we adjusted the margin to k = 1.51. However, the improvement in the 
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generalization error was small. In our future work, we plan to compare our method 
with other learning methods that use an adaptive learning coefficient j7]. 
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A Derivation of differential equations of order pa- 
rameter R and / 

First, we derive Eq. (J2UJ) . We square both sides of Eq. (|TB|) . For simplicity, we 

denote f(v,u,l) as /. 

jm+l.jm+l = J™.J™ + f x . x 

+ 2fJ m -x (27) 
From | J m \ = l m \fN and u — J ■ x, Eq. (j2Zj) becomes 

N(l m+1 ) 2 = N(l m ) 2 + f 2 + 2fl m u m (28) 

Averaging Eq. (|2*Hj) over the distribution P(u,v) of the teacher's total input v and 
the normalized student's total input u and assuming self-averaging for /, we rewrite 
Eq. (j23) as the next equation. Here, averaging is denoted as (•••). 

N(l m+1 ) 2 = N{l m ) 2 + (f 2 ) + 2r(fu m ). (29) 

At the thermodynamic limit, Eq. (}2*9"j) becomes a differential equation. Equation 
(|2^j) is rewritten as, 

AT(r +1 + r)(/ m+1 - r) = (/ 2 ) + 2l m (fu m ). (30) 
We substitute / m = /, / m+1 = I + dl, u m = u and 1/iV — > <it, and then simplify the 
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equation. The next equation is then given and Eq. (J20j) is derived. 

To derive the differential equation for R, we multiply both sides of Eq. (|16|) by 
B to obtain 



B-J m+1 = B J m + fB x 

mm+lRm+1 = m m R m + f v _ (32) 

Equation ()32|) becomes a time-dependent differential equation at the thermodynamic 
limit, N -»• oo. We substitute l m = I, l m+1 = I + dl, R m = R and R m+1 = R + dR 
and simplify, and then average Eq. (|H2*j) over P(u,v) in the same way as for the 
derivation of the differential equation for I. Assuming self-averaging for I and R, we 
can rewrite Eq. (J32)) as 



N(l + dl)(R + dR) = NlR+(fv) 
R dt +l H = {fv) - 



(33) 



By substituting Eq. ()20j) into Eq. (J3*3j) and simplifying, we then derive Eq. (I2ip. 
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Table 1: Asymptotic property of the learning curve of the proposed method, Hebbian 
learning, and perceptron learning. 



Learning method 


t to power 


Hebbian 


-0.5 


perceptron 


-0.333 


K = 1 


-0.334 


re = 10 


-0.334 


re = 100 


-0.334 


re = 1000 


-0.341 




Figure 1: Network structure of teacher and student perceptrons. 
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Generalization 
error 



Figure 2: Schematic diagram depicting the relationship between the overlap and the 
generalization error: the weight vector of the teacher and that of the student form 
an angle of (p. The teacher output is distinct from the student output on the arcs 
depicted as thick lines. 
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classlfl ca tl on 

(a) R=0 (b) R~1 



Figure 3: Relationship between the overlap and the generalization error in the inner 
potential space. 




Figure 4: Effect of using a margin in the proposed method. 
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Figure 5: Learning curves of three learning rules - the proposed method, Hebbian 
learning, and perceptron learning - obtained through analytical solutions. The 
margin k was 10, 1, respectively, for (a) and (b). A numerical solution obtained 
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Figure 6: Learning curves of three learning rules - the proposed method, Hebbian 
learning, and perceptron learning - obtained through analytical solutions. The 
margin k was 100, 1000, respectively, for (a) and (b). A numerical solution obtained 
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Figure 7: Learning curves of three learning rules - the proposed method, Hebbian 
learning, and perceptron learning - obtained through analytical solutions. The 
margin k was 1CT 5 and 10 5 , respectively, for (a) and (b). A numerical solution 
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Figure 8: Dynamics of the generalization error with an adaptively controlled margin, 
(a) k = I, (2) k = 1.5/ 
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Figure 9: Dynamics of the generalization error with an adaptively controlled margin. 
k = 21 
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Figure Legends 

Figure 1: Network structure of teacher and student perceptrons. 
Figure 2: Schematic diagram depicting the relationship between the overlap and the 
generalization error: the weight vector of the teacher and that of the student form 
an angle of ip. The teacher output is distinct from the student output on the arcs 
depicted as thick lines. 

Figure 3:Relationship between the overlap and the generalization error in the inner 
potential space. 

Figure 4: Effect of using a margin in the proposed algorithm. 

Figure 5:Learning curves of three learning rules - the proposed algorithm, Hebbian 
learning, and perceptron learning - obtained through analytical solutions. The 
margin k was 10 and 1, respectively, for (a), (b). A numerical solutions obtained 
through computer simulation are also shown. 

Figure 6:Learning curves of three learning rules - the proposed algorithm, Hebbian 
learning, and perceptron learning - obtained through analytical solutions. The 
margin k was 100 and 1000, respectively, for (a), (b). A numerical solutions obtained 
through computer simulation are also shown. 

Figure 7:Learning curves of three learning rules - the proposed method, Hebbian 
learning, and perceptron learning - obtained through analytical solutions. The 
margin k was 10~ 5 and 10 5 , respectively, for (a) and (b). A numerical solution 
obtained through computer simulations are also shown. 

Figure 8:Dynamics of the generalization error with an adaptively controlled margin. 
k = I and k = 1.51. 

Figure 9:Dynamics of the generalization error with an adaptively controlled margin. 



